Bootstrapped Authorship Attribution in Compression Space Notebook for PAN at CLEF2012
نویسندگان
چکیده
From a machine learning standpoint, the PAN 2012 Lab contest had one major challenge. In all authorship attribution tasks, the number of training documents was extremely low. We extended our previous work, in which compression distances to randomly selected prototype documents from the training corpus were used as feature representation. A supervised multi-class classifier was learned in the resulting feature space using the remaining documents. Inspired by the bootstrapped resampling method, we now drew document samples from the few source documents in order to obtain sufficient prototypes and samples to learn a supervised classifier. Using internal validation, we tuned the size of the document samples, compression method, distance measure, classification method, and decision threshold (open-class tasks) for optimal F1 score. With this scheme we submitted for the closed-class and open-class author identification tasks. In the overall results for these tasks we achieved a shared fourth ranking, based on the reported average recall of the 11 teams.
منابع مشابه
Authorship Attribution using Compression Distances
Authorship attribution has been a field of interest for researchers in the past, especially for forensic purposes. In this thesis, to obtain the degree of Bachelor of Science from the Leiden University, we investigate character n-grams and so-called compression distances to prototypes on several datasets, i.e., the datasets provided by PAN Labs (a benchmarking activity on uncovering plagiarism,...
متن کاملAuthorship Identification in Large Email Collections: Experiments Using Features that Belong to Different Linguistic Levels - Notebook for PAN at CLEF 2011
The aim of this paper is to explore the usefulness of using features from different linguistic levels to email authorship identification. Using various email datasets provided by PAN’11 lab we tested several feature groups in both authorship attribution and authorship verification subtasks. The selected feature groups combined with Regularized Logistic Regression and One-Class SVMmachine learni...
متن کاملEPSMS and the Document Occurrence Representation for Authorship Identification - Notebook for PAN at CLEF 2011
This paper describes the participation of the PISIS team in the authorship identification track of PAN’11. We adopted two different strategies for the tasks of authorship attribution and authorship verification. For authorship attribution we performed experiments with a document occurrence representation using a standard classification-based approach. Results obtained with this approach were mi...
متن کاملVote/Veto Meta-Classifier for Authorship Identification - Notebook for PAN at CLEF 2011
For the PAN 2011 authorship identification challenge we have developed a system based on a meta-classifier which selectively uses the results of multiple base classifiers. In addition we also performed feature engineering based on the given domain of e-mails. We present our system as well as results on the evaluation dataset. Our system performed second and third best in the authorship attribut...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012